Search for: All records

Creators/Authors contains: "Yuan, Jun"

« Prev Next »

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

TRIVEA: Transparent Ranking Interpretation using Visual Explanation of black-box Algorithmic rankers

https://doi.org/10.1007/s00371-023-03055-x

Yuan, Jun; Bhattacharjee, Kaustav; Islam, Akm Zahirul; Dasgupta, Aritra (May 2024, The Visual Computer)

Ranking schemes drive many real-world decisions, like, where to study, whom to hire, what to buy, etc. Many of these decisions often come with high consequences. For example, a university can be deemed less prestigious if not featured in a top-k list, and consumers might not even explore products that do not get recommended to buyers. At the heart of most of these decisions are opaque ranking schemes, which dictate the ordering of data entities, but their internal logic is inaccessible or proprietary. Drawing inferences about the ranking differences is like a guessing game to the stakeholders, like, the rankees (i.e., the entities who are ranked, like product companies) and the decision-makers (i.e., who use the rankings, like buyers). In this paper, we aim to enable transparency in ranking interpretation by using algorithmic rankers that learn from available data and by enabling human reasoning about the learned ranking differences using explainable AI (XAI) methods. To realize this aim, we leverage the exploration–explanation paradigm of human–data interaction to let human stakeholders explore subsets and groupings of complex multi-attribute ranking data using visual explanations of model fit and attribute influence on rankings. We realize this explanation paradigm for transparent ranking interpretation in TRIVEA, a visual analytic system that is fueled by: (i) visualizations of model fit derived from algorithmic rankers that learn the associations between attributes and rankings from available data and (ii) visual explanations derived from XAI methods that help abstract important patterns, like, the relative influence of attributes in different ranking ranges. Using TRIVEA, end users not trained in data science have the agency to transparently reason about the global and local behavior of the rankings without the need to open black-box ranking models and develop confidence in the resulting attribute-based inferences. We demonstrate the efficacy of TRIVEA using multiple usage scenarios and subjective feedback from researchers with diverse domain expertise.
more » « less
Full Text Available
A Human-in-the-loop Workflow for Multi-Factorial Sensitivity Analysis of Algorithmic Rankers

https://doi.org/10.1145/3597465.3605221

Yuan, Jun; Dasgupta, Aritra (June 2023, HILDA '23: Proceedings of the Workshop on Human-In-the-Loop Data Analytics)

Algorithmic rankers are ubiquitously applied in automated decision systems such as hiring, admission, and loan-approval systems. Without appropriate explanations, decision-makers often cannot audit or trust algorithmic rankers' outcomes. In recent years, XAI (explainable AI) methods have focused on classification models, but for algorithmic rankers, we are yet to develop state-of-the-art explanation methods. Moreover, explanations are also sensitive to changes in data and ranker properties, and decision-makers need transparent model diagnostics for calibrating the degree and impact of ranker sensitivity. To fulfill these needs, we take a dual approach of: i) designing explanations by transforming Shapley values for the simple form of a ranker based on linear weighted summation and ii) designing a human-in-the-loop sensitivity analysis workflow by simulating data whose attributes follow user-specified statistical distributions and correlations. We leverage a visualization interface to validate the transformed Shapley values and draw inferences from them by leveraging multi-factorial simulations, including data distributions, ranker parameters, and rank ranges.
more » « less
Full Text Available
Introducing contextual transparency for automated decision systems

https://doi.org/10.1038/s42256-023-00623-7

Sloane, Mona; Solano-Kamaiko, Ian René; Yuan, Jun; Dasgupta, Aritra; Stoyanovich, Julia (March 2023, Nature Machine Intelligence)

Full Text Available
mTSeer: Interactive Visual Exploration of Models on Multivariate Time-series Forecast

https://doi.org/10.1145/3411764.3445083

Xu, Ke; Yuan, Jun; Wang, Yifang; Silva, Claudio; Bertini, Enrico (May 2021, Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems)

Full Text Available
BetrFS: a compleat file system for commodity SSDs

https://doi.org/10.1145/3492321.3519571

Jiao, Yizheng; Bertron, Simon; Patel, Sagar; Zeller, Luke; Bennett, Rory; Mukherjee, Nirjhar; Bender, Michael A.; Condict, Michael; Conway, Alex; Farach-Colton, Martín; et al (March 2022, BetrFS: a compleat file system for commodity SSDs)

Full Text Available
External-memory Dictionaries in the Affine and PDAM Models

https://doi.org/10.1145/3470635

Bender, Michael A.; Conway, Alex; Farach-Colton, Martín; Jannen, William; Jiao, Yizheng; Johnson, Rob; Knorr, Eric; Mcallister, Sara; Mukherjee, Nirjhar; Pandey, Prashant; et al (September 2021, ACM Transactions on Parallel Computing)
null (Ed.)
Storage devices have complex performance profiles, including costs to initiate IOs (e.g., seek times in hard drives), parallelism and bank conflicts (in SSDs), costs to transfer data, and firmware-internal operations. The Disk-access Machine (DAM) model simplifies reality by assuming that storage devices transfer data in blocks of size B and that all transfers have unit cost. Despite its simplifications, the DAM model is reasonably accurate. In fact, if B is set to the half-bandwidth point, where the latency and bandwidth of the hardware are equal, then the DAM approximates the IO cost on any hardware to within a factor of 2. Furthermore, the DAM model explains the popularity of B-trees in the 1970s and the current popularity of B ɛ -trees and log-structured merge trees. But it fails to explain why some B-trees use small nodes, whereas all B ɛ -trees use large nodes. In a DAM, all IOs, and hence all nodes, are the same size. In this article, we show that the affine and PDAM models, which are small refinements of the DAM model, yield a surprisingly large improvement in predictability without sacrificing ease of use. We present benchmarks on a large collection of storage devices showing that the affine and PDAM models give good approximations of the performance characteristics of hard drives and SSDs, respectively. We show that the affine model explains node-size choices in B-trees and B ɛ -trees. Furthermore, the models predict that B-trees are highly sensitive to variations in the node size, whereas B ɛ -trees are much less sensitive. These predictions are born out empirically. Finally, we show that in both the affine and PDAM models, it pays to organize data structures to exploit varying IO size. In the affine model, B ɛ -trees can be optimized so that all operations are simultaneously optimal, even up to lower-order terms. In the PDAM model, B ɛ -trees (or B-trees) can be organized so that both sequential and concurrent workloads are handled efficiently. We conclude that the DAM model is useful as a first cut when designing or analyzing an algorithm or data structure but the affine and PDAM models enable the algorithm designer to optimize parameter choices and fill in design details.
more » « less
Full Text Available
Copy-on-Abundant-Write for Nimble File System Clones

Zhan, Yang; Conway, Alex; Jiao, Yizheng; Mukherjee, Nirjhar; Groombridge, Ian; Bender, Michael; Farach-Colton, Martin; Jannen, William; Johnson, Rob; Porter, Donald; et al (January 2021, ACM transactions on storage)

Making logical copies, or clones, of files and directories is critical to many real-world applications and work- flows, including backups, virtual machines, and containers. An ideal clone implementation meets the follow- ing performance goals: (1) creating the clone has low latency; (2) reads are fast in all versions (i.e., spatial locality is always maintained, even after modifications); (3) writes are fast in all versions; (4) the overall sys- tem is space efficient. Implementing a clone operation that realizes all four properties, which we call a nimble clone, is a long-standing open problem. This article describes nimble clones in B-ε-tree File System (BetrFS), an open-source, full-path-indexed, and write-optimized file system. The key observation behind our work is that standard copy-on-write heuristics can be too coarse to be space efficient, or too fine-grained to preserve locality. On the other hand, a write- optimized key-value store, such as a Bε -tree or an log-structured merge-tree (LSM)-tree, can decouple the logical application of updates from the granularity at which data is physically copied. In our write-optimized clone implementation, data sharing among clones is only broken when a clone has changed enough to warrant making a copy, a policy we call copy-on-abundant-write. We demonstrate that the algorithmic work needed to batch and amortize the cost of BetrFS clone operations does not erode the performance advantages of baseline BetrFS; BetrFS performance even improves in a few cases. BetrFS cloning is efficient; for example, when using the clone operation for container creation, BetrFSoutperforms a simple recursive copy by up to two orders-of-magnitude and outperforms file systems that have specialized Linux Containers (LXC) backends by 3–4×.
more » « less
Full Text Available
Copy-on-Abundant-Write for Nimble File System Clones

https://doi.org/10.1145/3423495

Zhan, Yang; Conway, Alex; Jiao, Yizheng; Mukherjee, Nirjhar; Groombridge, Ian; Bender, Michael A.; Farach-Colton, Martin; Jannen, William; Johnson, Rob; Porter, Donald E.; et al (February 2021, ACM Transactions on Storage)
null (Ed.)
Making logical copies, or clones, of files and directories is critical to many real-world applications and workflows, including backups, virtual machines, and containers. An ideal clone implementation meets the following performance goals: (1) creating the clone has low latency; (2) reads are fast in all versions (i.e., spatial locality is always maintained, even after modifications); (3) writes are fast in all versions; (4) the overall system is space efficient. Implementing a clone operation that realizes all four properties, which we call a nimble clone , is a long-standing open problem. This article describes nimble clones in B-ϵ-tree File System (BetrFS), an open-source, full-path-indexed, and write-optimized file system. The key observation behind our work is that standard copy-on-write heuristics can be too coarse to be space efficient, or too fine-grained to preserve locality. On the other hand, a write-optimized key-value store, such as a Bε-tree or an log-structured merge-tree (LSM)-tree, can decouple the logical application of updates from the granularity at which data is physically copied. In our write-optimized clone implementation, data sharing among clones is only broken when a clone has changed enough to warrant making a copy, a policy we call copy-on-abundant-write . We demonstrate that the algorithmic work needed to batch and amortize the cost of BetrFS clone operations does not erode the performance advantages of baseline BetrFS; BetrFS performance even improves in a few cases. BetrFS cloning is efficient; for example, when using the clone operation for container creation, BetrFS outperforms a simple recursive copy by up to two orders-of-magnitude and outperforms file systems that have specialized Linux Containers (LXC) backends by 3--4×.
more » « less
Full Text Available
External-Memory Dictionaries in the Affine and PDAM Models

https://doi.org/https://doi.org/10.1145/3323165.3323210

Bender, Michael; Conway, Alex; Farach-Colton, Martin; Jannen, William; Jiao, Yizheng; Johnson, Rob; Knorr, Eric; McAllister, Sara; Mukherjee, Nirjhar; Pandey, Prashant; et al (January 2021, ACM transactions on parallel computing)

Storage devices have complex performance profiles, including costs to initiate IOs (e.g., seek times in hard 15 drives), parallelism and bank conflicts (in SSDs), costs to transfer data, and firmware-internal operations. The Disk-access Machine (DAM) model simplifies reality by assuming that storage devices transfer data in blocks of size B and that all transfers have unit cost. Despite its simplifications, the DAM model is reasonably accurate. In fact, if B is set to the half-bandwidth point, where the latency and bandwidth of the hardware are equal, then the DAM approximates the IO cost on any hardware to within a factor of 2. Furthermore, the DAM model explains the popularity of B-trees in the 1970s and the current popularity of Bε -trees and log-structured merge trees. But it fails to explain why some B-trees use small nodes, whereas all Bε -trees use large nodes. In a DAM, all IOs, and hence all nodes, are the same size. In this article, we show that the affine and PDAM models, which are small refinements of the DAM model, yield a surprisingly large improvement in predictability without sacrificing ease of use. We present benchmarks on a large collection of storage devices showing that the affine and PDAM models give good approximations of the performance characteristics of hard drives and SSDs, respectively. We show that the affine model explains node-size choices in B-trees and Bε -trees. Furthermore, the models predict that B-trees are highly sensitive to variations in the node size, whereas Bε -trees are much less sensitive. These predictions are born out empirically. Finally, we show that in both the affine and PDAM models, it pays to organize data structures to exploit varying IO size. In the affine model, Bε -trees can be optimized so that all operations are simultaneously optimal, even up to lower-order terms. In the PDAM model, Bε -trees (or B-trees) can be organized so that both sequential and concurrent workloads are handled efficiently. We conclude that the DAM model is useful as a first cut when designing or analyzing an algorithm or data structure but the affine and PDAM models enable the algorithm designer to optimize parameter choices and fill in design details.
more » « less
Full Text Available
File Systems Fated for Senescence? Nonsense, Says Science

Conway, Alex; Bakshi, Ainesh; Jiao, Yizheng; Zhan, Yang; Bender, Michael; Jannen, William; Johnson, Rob; Kuszmaul, Bradley; Porter, Donald; Yuan, Jun; et al (January 2017, USENIX FAST)

File systems must allocate space for files without knowing what will be added or removed in the future. Over the life of a file system, this may cause subopti- mal file placement decisions which eventually lead to slower performance, or aging. Traditional file systems employ heuristics, such as collocating related files and data blocks, to avoid aging, and many file system imple- mentors treat aging as a solved problem. However, this paper describes realistic as well as syn- thetic workloads that can cause these heuristics to fail, inducing large performance declines due to aging. For example, on ext4 and ZFS, a few hundred git pull op- erations can reduce read performance by a factor of 2; performing a thousand pulls can reduce performance by up to a factor of 30. We further present microbenchmarks demonstrating that common placement strategies are ex- tremely sensitive to file-creation order; varying the cre- ation order of a few thousand small files in a real-world directory structure can slow down reads by 15 − 175×, depending on the file system. We argue that these slowdowns are caused by poor lay- out. We demonstrate a correlation between read perfor- mance of a directory scan and the locality within a file system’s access patterns, using a dynamic layout score. In short, many file systems are exquisitely prone to read aging for a variety of write workloads. We show, however, that aging is not inevitable. BetrFS, a file sys- tem based on write-optimized dictionaries, exhibits al- most no aging in our experiments. BetrFS typically out- performs the other file systems in our benchmarks; aged BetrFS even outperforms the unaged versions of these file systems, excepting Btrfs. We present a framework for understanding and predicting aging, and identify the key features of BetrFS that avoid aging.
more » « less
Full Text Available

« Prev Next »